ByteDance Seed & Tsinghua University
2025/04/26
Reinforcement Learning (RL) for LLM Post-Training can typically be modeled as a dataflow graph, consisting of:
In practice, we should implement the dataflow graph as execution pattern on GPU cluster.
From 0.2.0.post2 till now (after 0.3.0.post1), we have achieved speedup of ~1.4x in the DAPO (w/o dynamic sampling) workload.
verl introduces a hybrid-controller paradigm, consisting of
RayPPOTrainer) that concentrates the training control logic in a single processActorRolloutWorker) that conduct the distributed computation in an complex but efficient wayThanks to the programming model of single-controller, verl allows implementing different RL algorithms by only modifying a few lines, usually only in the fit function.
for prompts in dataloader:
# Stage 1: Sampling Trajectories
batch = actor.generate_sequences(prompts)
# Stage 2: Preparing Experiences
batch = reward.compute_reward(batch)
batch = reference.compute_log_prob(batch)
batch = critic.compute_values(batch)
batch = compute_advantage(batch, "gae")
# Stage 3: Training
critic.update_critic(batch)
actor.update_actor(batch)for prompts in dataloader:
# Stage 1: Sampling Trajectories
batch = actor.generate_sequences(prompts)
# Stage 2: Preparing Experiences
batch = reward.compute_reward(batch)
batch = reference.compute_log_prob(batch)
batch = compute_advantage(batch, "grpo")
# Stage 3: Training
critic.update_critic(batch)
actor.update_actor(batch)The optimal execution pattern for different workloads, e.g., training, generation, are usually different.
Instead of splitting the devices to deploy different engines separately for different workloads, causing many bubbles,
verl implements a hybrid engine that can switch between the different procudures on the same cluster, fully utilizing all the GPUs.
Thanks to the hybrid engine, verl allows flexibly switching between different parallelism strategies to achieve the optimal performance.
Generation:
Training & Inference:
Data Parallelism (DP) like FSDP is the most commonly used parallelism strategy.
However, DP performance might be damaged by load imbalance, which is especially severe in long-context training.
verl implements the following feature to improve load balance:
balance_batch: make the token numbers of the samples dispatched to each DP rank as balanced as possible by reordering the samples in each batch.However, in gradient accumulation, it’s not enough to only balance the total number of tokens for each rank in a batch, since DP syncs in the unit of micro batch.
So here comes the second feature:
use_dynamic_bsz: deviding the batch into micro batches in such a way that the token numbers of the micro batches are as balanced as possible.use_remove_padding): verl can save computation by removing padding tokens based on Flash Attention 2.enable_gradient_checkpointing)use_torch_compile)use_liger)lora_rank etc.)A canonical RL dataset in verl has the following fields:
prompt: a list of messages {"role": "...", "content": "..."}data_source: used to choose the reward functionreward_model: a dict containing
"ground_truth""style" like "model" or "rule"extra_info: a dict containing extra informationFor examples, please check the examples/data_preprocess.
For further customization, verl provides the data.custom_cls config,
The custom dataset class defined in the .py file is required to accept the following initialization parameters:
verl allows to define custom reward function via the custom_reward_function config:
The custom reward function defined in the .py file is required to accept the parameters passed from the reward manager __call__ method. For example, the NaiveRewardManager is defined as follows:
To implement more complex features, you might also want to directly add a new reward manager like PRIMERewardManager or DAPORewardManager.
To modify the loss function, the most convenient way is to
.backward() callcompute_policy_lossentropy_lossFor example, the DataParallelPPOActor.update_policy method defines the loss function as follows:
class DataParallelPPOActor(BasePPOActor):
def update_policy(self, data: DataProto):
pg_loss = compute_policy_loss(
old_log_prob=old_log_prob, log_prob=log_prob,
advantages=advantages, # ...
)
entropy_loss = agg_loss(loss_mat=entropy)
policy_loss = pg_loss - entropy_loss * entropy_coeff
kld = kl_penalty(
logprob=log_prob, ref_logprob=ref_log_prob, # ...
)
kl_loss = agg_loss(loss_mat=kld)
policy_loss = policy_loss + kl_loss * self.config.kl_loss_coef
loss.backward()As mentioned above, the main training logic is concentrated in the fit function of the trainer classes like RayPPOTrainer.
For example, the DAPORayTrainer class overrides the fit function to implement the “dynamic sampling” feature:
(See the next slide for the code ➡️)
class RayDAPOTrainer(RayPPOTrainer):
def fit(self):
for epoch in range(self.config.trainer.total_epochs):
batch = None
for batch_dict in self.train_dataloader:
new_batch = DataProto.from_single_dict(batch_dict)
num_gen_batches += 1
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
new_batch = new_batch.union(gen_batch_output)
if not self.config.algorithm.filter_groups.enable:
batch = new_batch
else:
# Getting `kept_traj_idxs` ...
new_batch = new_batch[kept_traj_idxs]
batch = new_batch if batch is None else DataProto.concat([batch, new_batch])
prompt_bsz = self.config.data.train_batch_size
if num_prompt_in_batch < prompt_bsz:
max_num_gen_batches = self.config.algorithm.filter_groups.max_num_gen_batches
if max_num_gen_batches <= 0 or num_gen_batches < max_num_gen_batches:
continue
else:
traj_bsz = self.config.data.train_batch_size * self.config.actor_rollout_ref.rollout.n
batch = batch[:traj_bsz]
# ...verl is approaching finishing the support for efficient RL training for huge MoE like DeepSeek-V3-671B, based on the following features:
GPTModel class for actor and criticFor more details, please check our PR #708.
The awesome SGLang RL team
OpenAIFunctionTool with end-to-end trainingFor more details, please check their PR #1037.
Besides, our team also integrates the async engine based on vLLM V1 AsyncLLM. Kudos to Xibin Wu for his great work!
For related resources like
etc., please scan the QR code:
verl: Flexible and Efficient RL for LLMs